\[~\]

\[~\]

\[~\]

The Introduction

\[~\]

Nowadays, with the improvement of the technologies there are an amount of data that allows us to understand if there is any problem or not in the healthy of a person.

Here in this project we explain and we try to predict whether a patient should be diagnosed with Heart Disease or not.

\[~\]

The Dataset

\[~\]

\[~\]

Kaggle, is the main platform where I found this interesting dataset where applying our main bayesian inference as the main scope of this project.

This dataset is composed by these features:

## 
## -- Column specification --------------------------------------------------------
## cols(
##   age = col_double(),
##   sex = col_double(),
##   cp = col_double(),
##   trtbps = col_double(),
##   chol = col_double(),
##   fbs = col_double(),
##   restecg = col_double(),
##   thalachh = col_double(),
##   exng = col_double(),
##   oldpeak = col_double(),
##   slp = col_double(),
##   caa = col_double(),
##   thall = col_double(),
##   output = col_double()
## )
## # A tibble: 6 x 14
##     age   sex    cp trtbps  chol   fbs restecg thalachh  exng oldpeak   slp
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl> <dbl>
## 1    63     1     3    145   233     1       0      150     0     2.3     0
## 2    37     1     2    130   250     0       1      187     0     3.5     0
## 3    41     0     1    130   204     0       0      172     0     1.4     2
## 4    56     1     1    120   236     0       1      178     0     0.8     2
## 5    57     0     0    120   354     0       1      163     1     0.6     2
## 6    57     1     0    140   192     0       1      148     0     0.4     1
## # ... with 3 more variables: caa <dbl>, thall <dbl>, output <dbl>

\[~\]

Analyzing these features, I recognized that the:

  1. Age -> it refers to the age of the patient
  2. sex -> it refers if the patient is:
    • male (1)
    • female (0)
  3. cp -> it refers to the chest pain type that could be of four types:
    • 0 is the typical angina
    • 1 is the atypical angina
    • 2 is the non-anginal pain
    • 3 is the asymptomatic type
  4. trestbps -> it refers to the resting blood pressure
  5. chol -> it refers to the serum cholesterol in mg/dl
  6. fbs -> it refers to the fasting blood sugar that if it is larger than 120 mg/dl is represented with:
    • 1 (true)
    • 0 (false)
  7. restecg -> it refers to the resting electrocardiographic results:
    • 0 as Normal
    • 1 having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • 2 showing probable or definite left ventricular hypertrophy by Estes’ criteria
  8. thalachh -> it refers to the maximum heart rate achieved
  9. exng -> it refers to the exercise induced angina that could be:
    • 1 (yes)
    • 0 (no)
  10. oldpeak -> it refers to the Previous peak
  11. slp -> it refers to the peak exercise ST segment:
    • 0 as up sloping
    • 1 as flat
    • 2 as down sloping
  12. caa -> it refers to the number of major vessels that goes from 0 to 3
  13. thall -> it refers to the diagnostic maximum heart rate achieved:
    • 0 as no-data
    • 1 as normal
    • 2 as fixed defect
    • 3 as reversible defect
  14. output -> it refers to the fact to have the heart attack or not

\[~\]

The main features detected are:

\[~\]

##       age             sex               cp             trtbps     
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :55.50   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.42   Mean   :0.6821   Mean   :0.9636   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol            fbs           restecg          thalachh    
##  Min.   :126.0   Min.   :0.000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:133.2  
##  Median :240.5   Median :0.000   Median :1.0000   Median :152.5  
##  Mean   :246.5   Mean   :0.149   Mean   :0.5265   Mean   :149.6  
##  3rd Qu.:274.8   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.000   Max.   :2.0000   Max.   :202.0  
##       exng           oldpeak           slp             caa        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3278   Mean   :1.043   Mean   :1.397   Mean   :0.7185  
##  3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##      thall           output     
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:0.000  
##  Median :2.000   Median :1.000  
##  Mean   :2.315   Mean   :0.543  
##  3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :1.000

\[~\] As we can see above:

  • the people captured in this datasets represent people between 29 and 77 years, so is evident that there isn’t any teenager in this dataset. Because, there is a low probability that the teenagers have the heart attack in their young ages.
  • the genders in this dataset are equally distributed between females and males.
  • for the output features, we observed that there is again an equally distribution between people that have heart attack and not.

Other features of this dataset could be easily see in this summary, so let’s go to the data visualization for having a good visualizations of which data we are treating!

\[~\]

The Data Visualization

\[~\]

In this subsection we want to describe graphically the dataset that we are analyzing to have a first taste of what is the better model to use in this case, we want to show below some interesting plots:

\[~\]

The visualization using PCA

\[~\]

PCA is a particular tool that allows us to show the behaviour of the patients (treated as points in this space) of our data. We see in this way which are the patients more similar to each other

\[~\]

\[~\]

Its normal to denote that there could be some losses on this representation due to the amount of features that we want to consider, but here we actually recover 89.8% of variance in the entire dataset using two principal components, so this is a good result.

\[~\]

The Categorical Data

\[~\]

Here, we want to analyze the percentages of each categorical data:

\[~\]

The most relevant is the typical angina (47.4%) that is defined as substernal chest pain precipitated by physical exertion or emotional stress and relieved with rest or nitroglycerin (represents inadequate flow of blood and oxygen to the heart). Women and elderly patients are usually have atypical symptoms both at rest and during stress, often in the setting of nonobstructive coronary artery disease (CAD).

\[~\]

the ST/heart rate slope (ST/HR slope), has been proposed as a more accurate ECG criterion for diagnosing significant coronary artery disease (CAD). The most relevant rate slope is approximately between the first and second type (as flat and as down sloping).

\[~\]

The most case is about 0 major vessels that aren’t affect of any damaged, but in other little case we see that there are a bunch of major vessels that have a problem, within the heart.

\[~\]

the maximum rate achieved is about a fixed defect.

\[~\]

Angina may feel like pressure in the chest, jaw or arm. It frequently may occur with exercise or stress. Some people with angina also report feeling lightheaded, overly tired, short of breath or nauseated. As the heart pumps harder to keep up with what you are doing, it needs more oxygen-rich blood. This reflects to the fact that we have no exercise induced by angina.

\[~\]

\[~\]

The Quantitative Data

\[~\]

Here, we illustrate the main features of the quantitative data, after have seen the categorical data in the previous section:

\[~\]

hist.and.summary('age', 'Persons Age')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   48.00   55.50   54.42   61.00   77.00
hist.and.summary('chol', 'Cholestoral')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   126.0   211.0   240.5   246.5   274.8   564.0
hist.and.summary('thalachh', 'Maximum Heart Rate Achieved')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.0   133.2   152.5   149.6   166.0   202.0
hist.and.summary('trtbps', 'Resting Blood Pressure')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    94.0   120.0   130.0   131.6   140.0   200.0
hist.and.summary('oldpeak', 'Previous Peak')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.800   1.043   1.600   6.200

\[~\] As we can see above, the distribution of these data are different and several are more similar, for example if you compare the maximum heart rate and the age of the people these distributions are different! So, these considerations could be useful for predicting the heart attack.

After this, we consider also the densities considering the case of having the heart attack and not:

\[~\]

dense.chart('age', 'Persons Age')
dense.chart('chol', 'Cholestoral')
dense.chart('thalachh', 'Maximum Heart Rate Achieved')
dense.chart('trtbps', 'Resting Blood Pressure')
dense.chart('oldpeak', 'Previous Peak')

The Correlations

\[~\]

Here, we want to highlight the correlations between the features treated in qualitative, quantitative and without any distinction.

\[~\]

With these correlation maps we see some interesting correlations is important to denote that the quantitative variables are more significant than the qualitative variables, because we are treating the quantitative measures, but here we can highlight and consider also the qualitative variables. So, here we denote in the complete map that the:

  • thalachh and age \(\rightarrow\) have -0.40 of correlations
  • cp and exng \(\rightarrow\) have -0.39 of correlations
  • thalachh and exng \(\rightarrow\) have -0.38 of correlations
  • slp and oldpeak \(\rightarrow\) have -0.58 of correlations

seems that these variables will be important in our predictions later, we will see in few moment if it is in this way!

\[~\]

Preliminary brief definitions

\[~\]

\[~\]

The Goal

\[~\]

The main goal is to leverage the main fully Bayesian analysis, based on understanding if a person has a heart attack or not.

So, the response variable \(Y_i\) (that is the “output” feature in our dataset) is the heart attack \(\in \{0,1\}\) and the predictor variables (\(x_i \in \mathbb{R}^{+}\)) have been chosen using the glm() function (we are cheating, but in this way we can sure that we are catching the right variables), used to fit generalized linear models, as we can see below:

\[~\]

## 
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + 
##     x10 + x11 + x12 + x13, family = binomial(link = "logit"), 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5751  -0.3868   0.1514   0.5841   2.6239  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.304169   2.577891   1.282 0.199936    
## x1          -0.001469   0.023464  -0.063 0.950062    
## x2          -1.750930   0.468199  -3.740 0.000184 ***
## x3           0.847283   0.185548   4.566 4.96e-06 ***
## x4          -0.020188   0.010386  -1.944 0.051916 .  
## x5          -0.004489   0.003806  -1.179 0.238252    
## x6           0.073463   0.532452   0.138 0.890263    
## x7           0.450607   0.348505   1.293 0.196021    
## x8           0.023134   0.010449   2.214 0.026835 *  
## x9          -0.981017   0.409806  -2.394 0.016672 *  
## x10         -0.523604   0.214467  -2.441 0.014630 *  
## x11          0.589074   0.349864   1.684 0.092236 .  
## x12         -0.826015   0.201922  -4.091 4.30e-05 ***
## x13         -0.887203   0.290730  -3.052 0.002276 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 210.35  on 288  degrees of freedom
## AIC: 238.35
## 
## Number of Fisher Scoring iterations: 6

\[~\]

As we can see above, the glm rejects the hypothesis to consider the variables:

… and also the intercept isn’t a good choice to admit it in our model.

An important remark to highlight is that we tried to do the frequentist approach of these data and we should remember that considering all of the features we achieved 238.35 as AIC value.

But, if we try with the features that we decide to consider in the next steps?

summary(glm(y ~ x2+x3+x4+x8+x9+x10+x11+x12+x13, family = binomial(link = "logit"),data=dat))
## 
## Call:
## glm(formula = y ~ x2 + x3 + x4 + x8 + x9 + x10 + x11 + x12 + 
##     x13, family = binomial(link = "logit"), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5157  -0.3801   0.1652   0.5941   2.6269  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.554313   1.924575   1.327 0.184440    
## x2          -1.598915   0.433172  -3.691 0.000223 ***
## x3           0.843330   0.180454   4.673 2.96e-06 ***
## x4          -0.021336   0.009893  -2.157 0.031023 *  
## x8           0.022131   0.009334   2.371 0.017738 *  
## x9          -0.951980   0.401449  -2.371 0.017723 *  
## x10         -0.514275   0.208059  -2.472 0.013444 *  
## x11          0.611946   0.341449   1.792 0.073101 .  
## x12         -0.816831   0.194912  -4.191 2.78e-05 ***
## x13         -0.902166   0.279919  -3.223 0.001269 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 214.19  on 292  degrees of freedom
## AIC: 234.19
## 
## Number of Fisher Scoring iterations: 6

The behaviour is slightly changed, we have 234.19 as AIC value, we will see if we achieve the same result!

Then, in summary we have N = 302 observations. Then we decided to consider two models and see the difference on which model at the end is the best model of our analysis.

\[~\]

The Second Model

\[~\] Now, we want to focus to another model always the logistic regression with the whole features and a new link function, the cloglog function.

Why this changment? Because, we want to see which model is better… and how can you see this? I’ll tell you where you can easy see this better preference.

What is the link function? So, the link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded. Once the transformation is complete, the relationship between the predictors and the response can be modeled with the logistic regression for example.

As we can see we will reproduce this second model changing the link function to see if it is better or not than the previous model:

\[~\]

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 302
##    Unobserved stochastic nodes: 13
##    Total graph size: 5499
## 
## Initializing model
## Inference for Bugs model at "C:/Users/Jeremy/AppData/Local/Temp/Rtmp4GRtjV/model233c4ac84acc.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 1000 discarded), n.thin = 10
##  n.sims = 2700 iterations saved
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## beta1      0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.003  1100
## beta10    -0.368   0.131  -0.634  -0.455  -0.367  -0.278  -0.123 1.001  2400
## beta11     0.233   0.207  -0.177   0.091   0.231   0.370   0.642 1.003   700
## beta12    -0.019   0.013  -0.046  -0.028  -0.019  -0.010   0.006 1.002  1800
## beta13    -0.563   0.168  -0.897  -0.673  -0.560  -0.453  -0.231 1.002  1100
## beta2     -1.135   0.249  -1.638  -1.310  -1.131  -0.961  -0.658 1.001  2700
## beta3      0.528   0.108   0.314   0.456   0.526   0.599   0.744 1.002  1500
## beta4      0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
## beta5      0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
## beta6     -0.044   0.289  -0.631  -0.238  -0.041   0.147   0.513 1.001  2700
## beta7      0.347   0.199  -0.043   0.215   0.349   0.479   0.742 1.001  2700
## beta8      0.017   0.004   0.010   0.014   0.017   0.019   0.024 1.007   750
## beta9     -0.594   0.276  -1.147  -0.783  -0.587  -0.412  -0.067 1.001  2500
## deviance 246.342   5.059 238.164 242.551 245.727 249.433 258.247 1.000  2700
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 12.8 and DIC = 259.1
## DIC is an estimate of expected predictive error (lower deviance is better).

\[~\]

Can you see something different? Seems yes.. see the DIC! Is better the previous model (although the difference is too tiny, but the previous model is better).

…mmh, what is the DIC?

\[~\]

The comparison between these two models (DIC)

\[~\]

As we can see above, we tried to make a different model only changing the link function to see if it is better to associate with our data, the DIC (Deviance information criterion) is our indicator that says to us which is the better model.

The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC.

the DIC is calculated as:

\[ DIC = p_D + \overline{D(\theta)} \]

where:

The larger the effective number of parameters is, the easier it is for the model to fit the data, and so the deviance needs to be penalized.

Lower is the DIC value better is the accuracy of the model, in this case is better the first model.

\[~\]

The Conclusions

\[~\] We try to plot the logistic regression using the parameters obtained in the first model, we compare the line for each parameter-response variable:

We prefer considering to show only the significant ones.

\[~\]

Simulation for parameters recovery

\[~\] Now, we understood that the first model is better than the second. So, for checking that the model proposed can correctly recover the model parameters. I Executed the simulation considering the data simulated from the model proposed, the beta parameters checked are the estimated from the first model:

  • \(\beta_2 \rightarrow -1.65\)
  • \(\beta_3 \rightarrow 0.88\)
  • \(\beta_4 \rightarrow -0.01\)
  • \(\beta_8 \rightarrow 0.03\)
  • \(\beta_9 \rightarrow -0.88\)
  • \(\beta_{10} \rightarrow -0.48\)
  • \(\beta_{11} \rightarrow 0.68\)
  • \(\beta_{12} \rightarrow -0.84\)
  • \(\beta_{13} \rightarrow -0.85\)
## module glm loaded
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 302
##    Unobserved stochastic nodes: 9
##    Total graph size: 3890
## 
## Initializing model
## Inference for Bugs model at "C:/Users/Jeremy/AppData/Local/Temp/RtmpOexlNl/model5bf4139040e9.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 1000 discarded), n.thin = 10
##  n.sims = 2700 iterations saved
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## beta10    -0.417   0.134  -0.683  -0.508  -0.412  -0.324  -0.162 1.001  2700
## beta11     0.221   0.248  -0.252   0.053   0.221   0.389   0.706 1.003   880
## beta12    -0.007   0.267  -0.549  -0.184  -0.009   0.168   0.531 1.000  2700
## beta13    -0.796   0.252  -1.281  -0.960  -0.798  -0.631  -0.314 1.001  2700
## beta2      2.092   0.331   1.444   1.870   2.090   2.318   2.731 1.002  1700
## beta3      0.719   0.239   0.247   0.563   0.720   0.876   1.191 1.000  2700
## beta4     -0.018   0.006  -0.030  -0.022  -0.017  -0.013  -0.006 1.002  1600
## beta8      0.004   0.005  -0.006   0.000   0.004   0.007   0.014 1.002  1600
## beta9      1.213   0.437   0.355   0.929   1.211   1.493   2.098 1.001  2700
## deviance 308.850   4.467 302.412 305.585 308.098 311.195 319.392 1.002  1200
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 10.0 and DIC = 318.8
## DIC is an estimate of expected predictive error (lower deviance is better).

The values are slighlty the same, so the parameters chose are perfectly good!

\[~\]

… Final conclusions

\[~\]

We saw the power and the weakness of Bayesian Analysis.

At first step we tried to face with the model on jags “cheating” a little bit on which parameters are maybe good for our purposes and after that we make our usual considerations, then we compared with another model not so much good, Of curse many little things could be improved in the future in this amazing analysis that I did in my opinion.

So thanks for all!

\[~\]

The References

\[~\]

  1. Heart Attack Analysis & Prediction Dataset - [Kaggle] https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

  2. For graphical plots - [Kaggle] https://www.kaggle.com/chaitanya99/heart-attack-analysis-prediction/data

  3. Link functions, the model types - [Alteryx] https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Selecting-a-Logistic-Regression-Model-Type-Logit-Probit-or/ta-p/111269

  4. HighCharter, for the amazing plots in R - [High Charter] https://jkunst.com/highcharter/articles/hchart.html

  5. Coronary artery diasease - [PubMed] https://pubmed.ncbi.nlm.nih.gov/3739881

  6. Nitroglycerin - [Wikipedia] https://en.wikipedia.org/wiki/Nitroglycerin

  7. Vessel Diases - [Digirad] https://www.digirad.com/triple-vessel-disease-diagnosis-treatment/

  8. Different types of Angina - [CardioSmart] https://www.cardiosmart.org/topics/angina

  9. Logistic Regression Bayesian Model R - [BayesBall] https://bayesball.github.io/BOOK/bayesian-multiple-regression-and-logistic-models.html

  10. Typical Angina - [NCBI] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680106

  11. RDocumentations - [RDocumention] https://www.rdocumentation.org